The World Happiness Report is a landmark survey of the state of global happiness . The report we use was taken from Kaggle and is the data for the year 2019. The report gained global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness. The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll.
The data was taken from Kaggle(https://www.kaggle.com/datasets/unsdsn/world-happiness?select=2019.csv).
The goal of our project is to investigate correlation between happiness indicators and happiness score. We try to implement and find out the best model (based on lowest RMSE) for this data set to predict the happiness score of a country. We are using score as our response variable and all other variables other than country, overall rank, and region as explanatory variables. To accomplis our main goal we don’t need the region of the countries or the overall rank. We have thoroughly investigated the data set and created summary statistics and visualizations of different important explanatory variables. There are no missing values in our data set. We have also created a correlation plot to see if there is any severely correlated explanatory variables.
set.seed(05292022) # set seed for cross validation
summary(happy) # checking for missing data
## Overall.rank Country.or.region Score GDP.per.capita
## Min. : 1.00 Length:156 Min. :2.853 Min. :0.0000
## 1st Qu.: 39.75 Class :character 1st Qu.:4.545 1st Qu.:0.6028
## Median : 78.50 Mode :character Median :5.380 Median :0.9600
## Mean : 78.50 Mean :5.407 Mean :0.9051
## 3rd Qu.:117.25 3rd Qu.:6.184 3rd Qu.:1.2325
## Max. :156.00 Max. :7.769 Max. :1.6840
## Social.support Healthy.life.expectancy Freedom.to.make.life.choices
## Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.056 1st Qu.:0.5477 1st Qu.:0.3080
## Median :1.272 Median :0.7890 Median :0.4170
## Mean :1.209 Mean :0.7252 Mean :0.3926
## 3rd Qu.:1.452 3rd Qu.:0.8818 3rd Qu.:0.5072
## Max. :1.624 Max. :1.1410 Max. :0.6310
## Generosity Perceptions.of.corruption
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1087 1st Qu.:0.0470
## Median :0.1775 Median :0.0855
## Mean :0.1848 Mean :0.1106
## 3rd Qu.:0.2482 3rd Qu.:0.1412
## Max. :0.5660 Max. :0.4530
# visualization plots for selected variables
ggplot(data = happy, aes(x = GDP.per.capita, y = Score)) + geom_point() + ggtitle("GDP per capita vs. Happiness Score") + geom_smooth(method = "lm")
ggplot(data = happy, aes(x = Generosity, y = Score)) + geom_point() + ggtitle("Generosity score vs. Happiness Score") + geom_smooth(method = "lm")
ggplot(data = happy, aes(x=Score)) + geom_histogram() + ggtitle("Histogram of Happiness Scores")
happy1 <- happy %>% dplyr::select(-Country.or.region, -Overall.rank) # dropping region and overall rank variables
#correlation plot
plot <- cor(happy1)
corrplot(plot)
# filtering for visualizing data for some major countries
data <- happy %>%
filter(Country.or.region %in% c("France", "Sweden", "Italy", "Spain", "England", "Portugal", "Greece", "Peru", "Chile", "Brazil", "Argentina", "Bolivia", "Venezuela", "Australia", "New Zealand", "Fiji", "China", "India", "Thailand", "Afghanistan", "Bangladesh", "United States of America", "Canada", "Burundi", "Angola", "Kenya", "Togo")) %>%
arrange(Country.or.region) %>%
mutate(Country = factor(Country.or.region, Country.or.region))
# Matrix format
mat <- data
rownames(mat) <- mat[,2]
mat <- mat %>% dplyr::select(-Country.or.region, -Overall.rank,-Country)
# Heat map for the major countries including all variables
p <- heatmaply(mat,
dendrogram = "none",
xlab = "", ylab = "",
main = "",
scale = "column",
margins = c(60,100,40,20),
grid_color = "white",
grid_width = 0.00001,
titleX = FALSE,
hide_colorbar = TRUE,
branches_lwd = 0.1,
label_names = c("Country", "Feature:", "Value"),
fontsize_row = 5, fontsize_col = 5,
labCol = colnames(mat),
labRow = rownames(mat),
heatmap_layers = theme(axis.line=element_blank())
)
p
In this plot we can see the values of our explanatory variables for some major countries. We can hover over this interactive map and see the spceific values for each variable.
summary(happy1)
## Score GDP.per.capita Social.support Healthy.life.expectancy
## Min. :2.853 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:4.545 1st Qu.:0.6028 1st Qu.:1.056 1st Qu.:0.5477
## Median :5.380 Median :0.9600 Median :1.272 Median :0.7890
## Mean :5.407 Mean :0.9051 Mean :1.209 Mean :0.7252
## 3rd Qu.:6.184 3rd Qu.:1.2325 3rd Qu.:1.452 3rd Qu.:0.8818
## Max. :7.769 Max. :1.6840 Max. :1.624 Max. :1.1410
## Freedom.to.make.life.choices Generosity Perceptions.of.corruption
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3080 1st Qu.:0.1087 1st Qu.:0.0470
## Median :0.4170 Median :0.1775 Median :0.0855
## Mean :0.3926 Mean :0.1848 Mean :0.1106
## 3rd Qu.:0.5072 3rd Qu.:0.2482 3rd Qu.:0.1412
## Max. :0.6310 Max. :0.5660 Max. :0.4530
sum(is.na(happy1)) # check missing values
## [1] 0
str(happy1) #checking for variable types
## 'data.frame': 156 obs. of 7 variables:
## $ Score : num 7.77 7.6 7.55 7.49 7.49 ...
## $ GDP.per.capita : num 1.34 1.38 1.49 1.38 1.4 ...
## $ Social.support : num 1.59 1.57 1.58 1.62 1.52 ...
## $ Healthy.life.expectancy : num 0.986 0.996 1.028 1.026 0.999 ...
## $ Freedom.to.make.life.choices: num 0.596 0.592 0.603 0.591 0.557 0.572 0.574 0.585 0.584 0.532 ...
## $ Generosity : num 0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...
## $ Perceptions.of.corruption : num 0.393 0.41 0.341 0.118 0.298 0.343 0.373 0.38 0.308 0.226 ...
train_index <- createDataPartition(happy1$Score, p = 0.8, list = FALSE) # 80-20 split for creating training and test datasets
train <- happy1[train_index, ] # training set
test <- happy1[-train_index, ] # test set
nearZeroVar(train, saveMetrics = TRUE) # check which predictors are zv/nzv
## freqRatio percentUnique zeroVar nzv
## Score 2.000000 99.21875 FALSE FALSE
## GDP.per.capita 1.500000 94.53125 FALSE FALSE
## Social.support 1.500000 93.75000 FALSE FALSE
## Healthy.life.expectancy 1.000000 77.34375 FALSE FALSE
## Freedom.to.make.life.choices 1.000000 83.59375 FALSE FALSE
## Generosity 1.333333 78.12500 FALSE FALSE
## Perceptions.of.corruption 1.333333 78.12500 FALSE FALSE
# create feature preprocessing blueprint
blueprint <- recipe(Score ~ ., data = happy1) %>%
step_normalize(all_numeric_predictors()) # scaling and normalizing all numeric variables
# preparing the blueprint
prepare <- prep(blueprint, training = train)
baked_train <- bake(prepare, new_data = train)
baked_test <- bake(prepare, new_data = test)
We have created training and test data sets by splitting the original data in an 80-20 split (80% is training data set). Considering that we only have 158 observations we decided to split the data into 80-20 as we will have more data for training set than a 70-30 split. We have initially investigated the data set for missing values and near zero/zero variace features. We don’t have any missing values or any zv/nzv features. We don’t have any catorgical features, as a result we don’t need any label encoding. We have normalized all of our numerical variables to get an appropriate blueprint. After completing all of the steps needed for our blueprint we have created the blueprint and baked our test and training data set using the blueprint.
library(GGally)
ggpairs(happy1, title="Correlogram ") # correlation plots to see possible relationship between the variables
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5) # setting cv specifications for all of the models
# KNN regression
k_grid <- expand.grid(k = seq(1, 20, by = 1)) # specifying the tuning parameters for knn regression
# implementing cv to find optimal k value for the knn regression
knn_fit <- train(blueprint,
data = train,
method = "knn",
trControl = cv_specs,
tuneGrid = k_grid,
metric = "RMSE")
knn_fit
## k-Nearest Neighbors
##
## 128 samples
## 6 predictor
##
## Recipe steps: normalize
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 102, 103, 101, 103, 103, 103, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.6716856 0.6727324 0.5383846
## 2 0.5745947 0.7367814 0.4557909
## 3 0.5355136 0.7711673 0.4268523
## 4 0.5253204 0.7788887 0.4103186
## 5 0.5207239 0.7849988 0.4067351
## 6 0.5190942 0.7890778 0.4110125
## 7 0.5137864 0.7943554 0.4043546
## 8 0.5122487 0.7979665 0.4027064
## 9 0.5176533 0.7950370 0.4046804
## 10 0.5232152 0.7928953 0.4105461
## 11 0.5316985 0.7876189 0.4168114
## 12 0.5408344 0.7815651 0.4242502
## 13 0.5484834 0.7777412 0.4306598
## 14 0.5520855 0.7761107 0.4334259
## 15 0.5576345 0.7721855 0.4361822
## 16 0.5617336 0.7698495 0.4402463
## 17 0.5611915 0.7721830 0.4381756
## 18 0.5642499 0.7722410 0.4417881
## 19 0.5676930 0.7707514 0.4445620
## 20 0.5700658 0.7708918 0.4481605
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 8.
ggplot(knn_fit)
min(knn_fit$results$RMSE) # cv RMSE for knn regression
## [1] 0.5122487
knn_fit$results
## k RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 0.6716856 0.6727324 0.5383846 0.08901254 0.08509549 0.07826967
## 2 2 0.5745947 0.7367814 0.4557909 0.08811573 0.08649872 0.07395478
## 3 3 0.5355136 0.7711673 0.4268523 0.08551245 0.06905233 0.07309664
## 4 4 0.5253204 0.7788887 0.4103186 0.08089025 0.06924853 0.06188832
## 5 5 0.5207239 0.7849988 0.4067351 0.07442011 0.06458766 0.05804139
## 6 6 0.5190942 0.7890778 0.4110125 0.07045122 0.06325507 0.05978091
## 7 7 0.5137864 0.7943554 0.4043546 0.07458523 0.06421677 0.06281384
## 8 8 0.5122487 0.7979665 0.4027064 0.07459189 0.06502967 0.06591925
## 9 9 0.5176533 0.7950370 0.4046804 0.06985044 0.06208819 0.06218902
## 10 10 0.5232152 0.7928953 0.4105461 0.07154932 0.06316004 0.05966839
## 11 11 0.5316985 0.7876189 0.4168114 0.06659227 0.06070177 0.05531366
## 12 12 0.5408344 0.7815651 0.4242502 0.06487918 0.06091166 0.05517103
## 13 13 0.5484834 0.7777412 0.4306598 0.06465006 0.05795219 0.05202640
## 14 14 0.5520855 0.7761107 0.4334259 0.06188118 0.05784074 0.04967546
## 15 15 0.5576345 0.7721855 0.4361822 0.06453926 0.05754519 0.04866091
## 16 16 0.5617336 0.7698495 0.4402463 0.06452682 0.06154079 0.04839653
## 17 17 0.5611915 0.7721830 0.4381756 0.06466809 0.05868196 0.04817699
## 18 18 0.5642499 0.7722410 0.4417881 0.06551335 0.06151041 0.04924281
## 19 19 0.5676930 0.7707514 0.4445620 0.06433729 0.06114380 0.04727939
## 20 20 0.5700658 0.7708918 0.4481605 0.06290682 0.06053865 0.04657724
# MLR model
# implementing cv to find RMSE for linear regression model
lm_fit <- train(blueprint,
data = train,
method = "lm",
trControl = cv_specs,
metric = "RMSE")
lm_fit
## Linear Regression
##
## 128 samples
## 6 predictor
##
## Recipe steps: normalize
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 104, 102, 103, 100, 103, 103, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.5105353 0.7908418 0.4009258
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Ridge regression model
lambda_grid <- 10^seq(-3, 3, length = 100) # grid of lambda values to search over
ridge_cv <- train(blueprint,
data = train,
method = "glmnet", # for ridge regression
trControl = cv_specs,
tuneGrid = expand.grid(alpha = 0, lambda = lambda_grid), # alpha = 0 implements ridge regression
metric = "RMSE")
# results from the CV procedure
ridge_cv$bestTune # optimal lambda
## alpha lambda
## 37 0 0.1519911
min(ridge_cv$results$RMSE) # RMSE for optimal lambda
## [1] 0.5020319
# The LASSO model
# implement CV
lasso_cv <- train(blueprint,
data = train,
method = "glmnet", # for lasso
trControl = cv_specs,
tuneGrid = expand.grid(alpha = 1, lambda = lambda_grid), # alpha = 1 implements lasso
metric = "RMSE")
# results from the CV procedure
lasso_cv$bestTune # optimal lambda
## alpha lambda
## 11 1 0.004037017
min(lasso_cv$results$RMSE) # RMSE for optimal lambda
## [1] 0.5125943
# MARS model
param_grid <- expand.grid(degree = 1:3, nprune = seq(1, 100, length.out = 10)) # setting parameters for mars model
# implement CV
mars_cv <- train(blueprint,
data = train,
method = "earth",
trControl = cv_specs,
tuneGrid = param_grid,
metric = "RMSE")
mars_cv$bestTune # optimal tuning parameters
## nprune degree
## 2 12 1
min(mars_cv$results$RMSE) # CV RMSE
## [1] 0.5141464
# Tree model
#implement cv
tree_cv <- train(blueprint,
data = train,
method = "rpart",
trControl = cv_specs,
tuneLength = 20,
metric = "RMSE")
tree_cv$bestTune # optimal tuning parameter
## cp
## 1 0
min(tree_cv$results$RMSE) # CV RMSE
## [1] 0.5913842
# Bagging
bag_fit <- bagging(Score ~ .,
data = baked_train,
nbagg = 500, # number of trees to grow (bootstrap replications)
coob = TRUE, # yes to computing OOB error estimate
control = rpart.control(minsplit = 2, cp = 0, xval = 0)) # details of each tree
bag_fit #results for bagging
##
## Bagging regression trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = Score ~ ., data = baked_train, nbagg = 500,
## coob = TRUE, control = rpart.control(minsplit = 2, cp = 0,
## xval = 0))
##
## Out-of-bag estimate of root mean squared error: 0.5025
# Random Forest model
param_grid <- expand.grid(mtry = seq(1, 6, 1), # sequence of 1 to number of predictors (6)
splitrule = "variance",
min.node.size = 2) # for each tree
# implement cv
rf_cv <- train(blueprint,
data = train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid,
metric = "RMSE")
rf_cv$bestTune$mtry # optimal tuning parameter
## [1] 2
min(rf_cv$results$RMSE) # CV RMSE
## [1] 0.4867644
Our response variable is a numerical variable so we chose to implement different regression models. We didn’t choose GAM fit model and we haven’t used any polynomial terms for any variables while doing the MLR. We didn’t choose GAM fit because it was out of our scope to use cross validation for this method. We have checked the correlation plot and we have seen no polynomial trend in any of the variables. We have implemented our cv for KNN, MLR, Ridge, Lasso, Mars, single regression tree, and Random Forest. We have also created a bag fit model using nbag = 500. For our cv specs we have used 5 fold repeated 5 times. We have investigated the optimal parameters for each of the models and the RMSE for the optimal tuning parameters. The table below shows the results of our models.
| Model | CV RMSE | Optimal tuning parameter |
|---|---|---|
| KNN | 0.512 | 8 |
| MLR | 0.510 | N/A |
| RIDGE | 0.502 | Alpha = 0 Lambda = 0.15 |
| LASSO | 0.512 | Alpha = 1 Lambda = 0.004 |
| MARS | 0.514 | nprune = 12 |
| Single Regression Tree | 0.591 | cp = 0 |
| Bagging | 0.5025 | nbag = 500 |
| Random Forest | 0.486 | mtry = 2 |
From the table we can see that the optimal model was Random Forest with the lowest RMSE of 0.487 with optimal mtry=2.
# final model using rf
rf_fit <- ranger(Score ~ .,
data = baked_train,
num.trees = 500,
mtry = rf_cv$bestTune$mtry, # using best tuning parameter from cv rf
splitrule = "variance",
min.node.size = 2,
importance = "impurity")
# variable importance
sort(rf_fit$variable.importance, decreasing = TRUE)
## Social.support Healthy.life.expectancy
## 45.765635 40.436350
## GDP.per.capita Freedom.to.make.life.choices
## 37.375213 12.895007
## Perceptions.of.corruption Generosity
## 10.792288 5.964881
preds_rf <- predict(rf_fit, data = baked_test, type = "response") # predictions on test set
sqrt(mean((preds_rf$predictions - baked_test$Score)^2)) # test set RMSE
## [1] 0.6027718
Based on our cv results we chose to create a final model using the random forest with mtry = 2. We have created the optimal model on our training data and tried to investigate the importance of the variables. The most important variable was social support. We have also obtained predictions using the model on our test data set. The test RMSE or the test set error for the final model was 0.603.